__author__ = 'Alice Jacques <alice.jacques@noirlab.edu>, NOIRLab Astro Data Lab Team <datalab@noao.edu>'
__version__ = '20201209'
__datasets__ = ['ls_dr8','sdss_dr16','gaia_dr2','des_dr1','smash_dr2','unwise_dr1','allwise']
__keywords__ = ['crossmatch','joint query','mydb','vospace','image cutout']
by Alice Jacques and the NOIRLab Astro Data Lab Team
The crossmatch tables at Astro Data Lab are structured as followed:
schema1.xNpN__table1__schema2__table2
where the N in NpN encode the numerical value of the crossmatching radius (since dots '.' are not allowed in table names).
Example:
ls_dr8.x1p5__tractor_primary_n__gaia_dr2__gaia_source
is a crossmatch table (indicated by the leading x), located in the ls_dr8 schema, and it crossmatches the ls_dr8.tractor_primary_n table with the gaia_dr2.gaia_source table (which lives in the gaia_dr2 schema) within a 1.5 arcseconds radius ('1p5') .
This is admittedly long, but clean, consistent, and most importantly, parsable. The use of double-underscores '__' is to distinguish from single underscores often used in schema and table names.
All crossmatch tables shall be minimalist, i.e. have only these columns: id1,ra1,dec1,id2,ra2,dec2,distance. Column descriptions in the crossmatch table shall contain the original column names in parentheses (makes it parsable).
For example:
ls_dr8.x1p5__tractor_primary_n__gaia_dr2__gaia_source
| Column | Description | Datatype |
|---|---|---|
| id1 | ID in left/first table (ls_id) | BIGINT |
| ra1 | Right ascension in left/first table (ra) | DOUBLE |
| dec1 | Declination in left/first table (dec) | DOUBLE |
| id2 | ID in right/second table (source_id) | BIGINT |
| ra2 | Right ascension in right/second table (ra) | DOUBLE |
| dec2 | Declination in right/second table (dec) | DOUBLE |
| distance | Distance between ra1,dec1 and ra2,dec2 (arcsec) | DOUBLE |
allwise.x1p5__source__des_dr1__main and des_dr1.x1p5__main__allwise__source.The list of available crossmatch tables can be viewed on our query interface here under their respective schema.
If you use this notebook for your published science, please acknowledge the following:
Data Lab concept paper: Fitzpatrick et al., "The NOAO Data Laboratory: a conceptual overview", SPIE, 9149, 2014, http://dx.doi.org/10.1117/12.2057445
Data Lab disclaimer: https://datalab.noao.edu/disclaimers.php
# std lib
from getpass import getpass
# 3rd party
from astropy.utils.data import download_file #import file from URL
from matplotlib.ticker import NullFormatter
# Data Lab
from dl import authClient as ac, queryClient as qc, storeClient as sc
from dl.helpers.utils import convert # converts table to Pandas dataframe object
Much of the functionality of Data Lab can be accessed without explicitly logging in (the service then uses an anonymous login). But some capacities, for instance saving the results of your queries to your virtual storage space, require a login (i.e. you will need a registered user account).
If you need to log in to Data Lab, issue this command, and respond according to the instructions:
#ac.login(input("Enter user name: (+ENTER) "),getpass("Enter password: (+ENTER) "))
ac.whoAmI()
We can use Data Lab's Query Client to access the pre-crossmatched tables hosted by Data Lab. First let's get a total count of the number of objects (nrows) in SDSS DR16 that are also in LS DR8:
%%time
query="SELECT nrows FROM tbl_stat WHERE schema='sdss_dr16' and tbl_name='x1p5__specobj__ls_dr8__tractor_primary'"
# Call query manager
response = qc.query(sql=query)
print(response)
Now let's print just the first 100 rows:
query = "SELECT * FROM sdss_dr16.x1p5__specobj__ls_dr8__tractor_primary LIMIT 100"
response = qc.query(sql=query)
convert(response) # convert the table into a Pandas dataframe object
In order to extract only the relevant columns pertaining to our science question from multiple data tables, we may write a query that uses a JOIN statement. There are 4 main types of JOIN statements that we could use, and which one we decide to choose depends on how we want the information to be extracted.
Take a moment to look over the figure below outlining the various JOIN statement types.
NOTE: the default JOIN is an INNER JOIN.

JOIN LATERAL¶In nearest neighbor crossmatch queries, we use JOIN LATERAL, which is like a SQL foreach loop that will iterate over each row in a result set and evaluate a subquery using that row as a parameter.
Here we will examine spectroscopic redshifts from SDSS DR16 and photometry from LS DR8. The two crossmatch tables related to these two catalogs are:
ls_dr8.x1p5__tractor__sdss_dr16__specobj and sdss_dr16.x1p5__specobj__ls_dr8__tractor_primary. The choice of which of these two crossmatch tables to use should be based on the science question being posed. For instance, the question 'how does a galaxy's structure change with redshift?' is dependent on the redshift values obtained from SDSS DR16, so we should use the crossmatch table that has SDSS DR16 as the first table. So, the relevant information we want from our 3 tables of interest for this example are:
sdss_dr16.x1p5__specobj__ls_dr8__tractor_primarysdss_dr16.specobjls_dr8.tractor^PSF=stellar, REX=round exponential galaxy, DEV=deVauc, EXP=exponential, COMP=composite, DUP=Gaia source fit by different model.
Now that we know what we want and where we want it from, let's write the query and then print the results on screen.
query = ("""SELECT
X.ra1 as ra_sdss,X.dec1 as dec_sdss,
S.z,S.class,S.veldisp,S.veldisperr,
L.ra as ra_ls,L.dec as dec_ls,L.type,L.g_r,L.r_z
FROM sdss_dr16.x1p5__specobj__ls_dr8__tractor_primary as X
JOIN sdss_dr16.specobj as S ON X.id1 = S.specobjid
JOIN ls_dr8.tractor AS L ON X.id2 = L.ls_id
WHERE X.ra1 BETWEEN %s and %s and X.dec1 BETWEEN %s and %s
""") %(110,200,7.,40.) #large region
print(query)
%%time
response = qc.query(sql=query) # default format is a CSV file
result = convert(response) # convert to a Pandas dataframe object
result
VOSpace is a convenient storage space for users to save their work. It can store any data or file type. We can save the results from the same query to our virtual storage space:
response = qc.query(sql=query, fmt='csv', out='vos://testresult.csv')
Let's ensure the file was saved in VOSpace:
sc.ls(name='vos://testresult.csv')
Now let's remove the file we just saved to VOSpace:
sc.rm (name='vos://testresult.csv')
Let's ensure the file was removed from VOSpace:
sc.rm (name='vos://testresult.csv')
MyDB is a useful OS remote per-user relational database that can store data tables. Furthermore, the results of queries can be directly saved to MyDB, as we show in the following example:
response = qc.query(sql=query, fmt='csv', out='mydb://testresult')
Ensure the table has been saved to MyDB by calling the mydb_list() function, which will list all tables currently in a user's MyDB:
print(qc.mydb_list(),"\n")
Now let's drop the table from our MyDB.
qc.mydb_drop('testresult')
Ensure it has been removed by calling the mydb_list() function again:
print(qc.mydb_list(),"\n")
We can construct a query to run a crossmatch in the database using the q3c_join() function, which identifies all matching objects within a specified radius in degrees (see details on using Q3C functions). For this example, we will search only for the single nearest neighbor. For different examples of crossmatching, see our How to crossmatch tables notebook.
First, let's query a small selection of sample data from the Data Lab database and store it in MyDB as gaia_sample. This will act as our "user-provided table".
%%time
query = """SELECT source_id,ra,dec,parallax,pmra,pmdec FROM gaia_dr2.gaia_source
WHERE ra<200 AND ra>124 AND random_id<10 LIMIT 70000"""
response = qc.query(sql=query)
convert(response)
qc.query(sql=query,out='mydb://gaia_sample',drop=True)
Next let's crossmatch our gaia_sample table with Data Lab's pre-crossmatched table between SMASH DR2 and allWISE smash_dr2.x1p5__object__allwise__source. We'll write our crossmatch query using the q3c_join() function as well as the q3c_dist() function, searching for the nearest neighbor within a 1.5 arcsec radius (which must be converted into degrees for the query, so we divide by 3600.0). We will then save it in MyDB as gaia_sample_xmatch.
%%time
qu = """SELECT
G.source_id,ss.id1,ss.id2,G.ra,G.dec,ss.ra1,ss.dec1,ss.ra2,ss.dec2,
(q3c_dist(G.ra,G.dec,ss.ra1,ss.dec1)*3600.0) as dist_arcsec
FROM mydb://gaia_sample AS G
JOIN LATERAL (
SELECT S.id1,S.id2,S.ra1,S.dec1,S.ra2,S.dec2
FROM
smash_dr2.x1p5__object__allwise__source AS S
WHERE
q3c_join(G.ra,G.dec,S.ra1,S.dec1, 1.5/3600.0)
ORDER BY
q3c_dist(G.ra,G.dec,S.ra1,S.dec1)
ASC LIMIT 1
) AS ss ON true;
"""
resp = qc.query(sql=qu)
convert(resp)
qc.query(sql=qu,out='mydb://gaia_sample_xmatch',drop=True)
Now we can write a query using the JOIN statement in order to extract the columns we want from our tables of interest. Just as in the previous section, let's first make an outline of which tables we'd like to extract columns from.
mydb://gaia_sample_xmatchsmash_dr2.objectallwise.source mydb://gaia_samplequery = ("""SELECT
X.source_id,X.id1,X.id2,X.ra,X.dec,
s.gmag,s.rmag,s.zmag,
a.w1mpro,a.w2mpro,a.w3mpro,
g.parallax,g.pmra,g.pmdec
FROM mydb://gaia_sample_xmatch as X
JOIN smash_dr2.object as s ON X.id1 = s.id
JOIN allwise.source AS a ON X.id2 = a.cntr
JOIN mydb://gaia_sample AS g ON X.source_id = g.source_id
""")
print(query)
response = qc.query(sql=query) # default format is a CSV file
convert(response)
Here we compare the speed of using the q3c_join() function to crossmatch directly in a JOIN query (query1) versus using a pre-crossmatched table in a JOIN query (query2). We select objects from the two catalogs and retrieve the same specified columns for the two queries. We will see that query1 times out after 300 seconds (5 minutes) and fails to retrieve results, while query2 runs for about 60-90 seconds (1-1.5 minutes) and will retrieve the 3.6 million rows we queried for.
%%time
query1 = """SELECT
a.cntr as id1,a.ra as ra1,a.dec as dec1,a.pmdec,a.pmra,a.w1mpro,a.w2mpro,
gg.specobjid as id2,gg.ra as ra2,gg.dec as dec2,gg.z,gg.class,gg.veldisp,gg.veldisperr,
(q3c_dist(a.ra,a.dec,gg.ra,gg.dec)*3600.0) as dist_arcsec
FROM
allwise.source AS a
INNER JOIN LATERAL (
SELECT s.specobjid,s.ra,s.dec,s.z,s.class,s.veldisp,s.veldisperr
FROM
sdss_dr16.specobj AS s
WHERE
q3c_join(a.ra, a.dec, s.ra, s.dec, 1.5/3600.0)
ORDER BY
random()
ASC LIMIT 1
) as gg ON true
"""
resp1 = qc.query(sql=query1,timeout=300)
convert(resp1)
%%time
query2 = """SELECT
X.id1,X.id2,X.ra1,X.dec1,X.ra2,X.dec2,X.distance as dist_arcsec,
a.pmdec,a.pmra,a.w1mpro,a.w2mpro,
s.z,s.class,s.veldisp,s.veldisperr
FROM
allwise.x1p5__source__sdss_dr16__specobj as X
JOIN
allwise.source as a ON X.id1 = a.cntr
JOIN
sdss_dr16.specobj AS s ON X.id2 = s.specobjid
"""
resp2 = qc.query(sql=query2)
convert(resp2)
We again select objects from two catalogs and retrieve the same specified columns for two queries. query3 uses a pre-crossmatched table in a JOIN query and query4 crossmatches directly in the JOIN query. We will see that query3 runs for about 60-90 seconds (1-1.5 minutes) and will retrieve the 4.4 million rows we queried for, while query4 times out after 300 seconds (5 minutes) and fails to retrieve results.
%%time
query3 = """SELECT
X.id1,X.id2,X.ra1,X.dec1,X.ra2,X.dec2,X.distance as dist_arcsec,
u.mag_w1_vg,u.mag_w2_vg,s.z,s.class,s.veldisp,s.veldisperr
FROM
unwise_dr1.x1p5__object__sdss_dr16__specobj as X
JOIN
unwise_dr1.object as u ON X.id1 = u.unwise_objid
JOIN
sdss_dr16.specobj AS s ON X.id2 = s.specobjid
ORDER BY
random()
"""
resp3 = qc.query(sql=query3,timeout=300)
convert(resp3)
%%time
query4 = """SELECT
u.unwise_objid as id1,u.ra as ra1,u.dec as dec1,u.mag_w1_vg,u.mag_w2_vg,
ss.specobjid as id2,ss.ra as ra2,ss.dec as dec2,ss.z,ss.class,ss.veldisp,ss.veldisperr,
(q3c_dist(u.ra,u.dec,ss.ra,ss.dec)*3600.0) as dist_arcsec
FROM
unwise_dr1.object AS u
INNER JOIN LATERAL (
SELECT s.specobjid,s.ra,s.dec,s.z,s.class,s.veldisp,s.veldisperr
FROM
sdss_dr16.specobj AS s
WHERE
q3c_join(u.ra, u.dec, s.ra, s.dec, 1.5/3600.0)
ORDER BY
random()
ASC LIMIT 1
) as ss ON true
"""
resp4 = qc.query(sql=query4,timeout=300)
convert(resp4)
def make_cutout_comparison_table(ra_in1, dec_in1, ra_in2, dec_in2):
"""
Obtain unWISE DR1 and LS DR8 color JPEG images from Legacy Survey team cutout tool at NERSC
"""
un_img = []
ls_img = []
for i in range(len(ra_in1)):
un_cutout_url = 'https://www.legacysurvey.org/viewer/cutout.jpg?ra=%g&dec=%g&layer=unwise-neo6&pixscale=0.3' % (ra_in1[i],dec_in1[i])
img = plt.imread(download_file(un_cutout_url,cache=True,show_progress=False,timeout=120))
un_img.append(img)
ls_cutout_url = 'https://www.legacysurvey.org/viewer/cutout.jpg?ra=%g&dec=%g&layer=ls-dr8&pixscale=0.3' % (ra_in2[i],dec_in2[i])
img = plt.imread(download_file(ls_cutout_url,cache=True,show_progress=False,timeout=120))
ls_img.append(img)
return un_img,ls_img
def plot_cutouts(img1,img2,cat1,cat2):
"""
Plot images in two rows with 5 images in each row
"""
fig = plt.figure(figsize=(21,7))
for i in range(len(img1)):
ax = fig.add_subplot(2,6,i+1)
ax.imshow(img1[i])
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
ax.tick_params(axis='both',which='both',length=0)
ax.text(0.02,0.93,'ra=%.5f'%list_ra1[i],transform=ax.transAxes,fontsize=12,color='white')
ax.text(0.02,0.85,'dec=%.5f'%list_dec1[i],transform=ax.transAxes,fontsize=12,color='white')
ax.text(0.02,0.77,cat1,transform=ax.transAxes,fontsize=12,color='white')
ax = fig.add_subplot(2,6,i+7)
ax.imshow(img2[i])
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
ax.tick_params(axis='both',which='both',length=0)
ax.text(0.02,0.93,'ra=%.5f'%list_ra2[i],transform=ax.transAxes,fontsize=12,color='white')
ax.text(0.02,0.85,'dec=%.5f'%list_dec2[i],transform=ax.transAxes,fontsize=12,color='white')
ax.text(0.02,0.77,cat2,transform=ax.transAxes,fontsize=12,color='white')
plt.subplots_adjust(wspace=0.02, hspace=0.03)
... then save them as arrays.
%%time
q = """SELECT ra1,dec1,ra2,dec2
FROM unwise_dr1.x1p5__object__ls_dr8__tractor_primary
WHERE ra1>300 AND dec1>33 ORDER BY random() LIMIT 5"""
r = qc.query(sql=q,fmt='pandas')
list_ra1=r['ra1'].values # ".values" convert to numpy array
list_dec1=r['dec1'].values
list_ra2=r['ra2'].values
list_dec2=r['dec2'].values
cat1='unWISE'
cat2='ls dr8'
un_img,ls_img = make_cutout_comparison_table(list_ra1,list_dec1,list_ra2,list_dec2)
plot_cutouts(un_img,ls_img,cat1,cat2)
Here we will compare two images of the same object from two different catalogs, SDSS and DES DR1.
def make_cutout_comparison_table2(ra_in1, dec_in1, ra_in2, dec_in2):
"""
Obtain SDSS and DES DR1 color JPEG images from Legacy Survey team cutout tool at NERSC
"""
sd_img = []
de_img = []
for i in range(len(ra_in1)):
sd_cutout_url = 'https://www.legacysurvey.org/viewer/cutout.jpg?ra=%g&dec=%g&layer=sdss&pixscale=0.25' % (ra_in1[i],dec_in1[i])
img = plt.imread(download_file(sd_cutout_url,cache=True,show_progress=False,timeout=120))
sd_img.append(img)
de_cutout_url = 'https://www.legacysurvey.org/viewer/cutout.jpg?ra=%g&dec=%g&layer=des-dr1&pixscale=0.25' % (ra_in2[i],dec_in2[i])
img = plt.imread(download_file(de_cutout_url,cache=True,show_progress=False,timeout=120))
de_img.append(img)
return sd_img,de_img
... then save them as arrays.
%%time
q = """SELECT ra1,dec1,ra2,dec2
FROM sdss_dr16.x1p5__specobj__des_dr1__main
ORDER BY random() LIMIT 5"""
r = qc.query(sql=q,fmt='pandas')
list_ra1=r['ra1'].values # ".values" convert to numpy array
list_dec1=r['dec1'].values
list_ra2=r['ra2'].values
list_dec2=r['dec2'].values
cat1='sdss dr16'
cat2='des dr1'
sd_img,de_img = make_cutout_comparison_table2(list_ra1,list_dec1,list_ra2,list_dec2)
plot_cutouts(sd_img,de_img,cat1,cat2)
qc.mydb_import('gals','./gals.csv',drop=True)
def make_cutout_comparison_table3(ra_in, dec_in):
"""
Obtain SDSS and DES DR1 color JPEG images from Legacy Survey team cutout tool at NERSC
"""
sd_img = []
de_img = []
for i in range(len(ra_in)):
sd_cutout_url = 'https://www.legacysurvey.org/viewer/cutout.jpg?ra=%g&dec=%g&layer=sdss&pixscale=0.5' % (ra_in[i],dec_in[i])
img = plt.imread(download_file(sd_cutout_url,cache=True,show_progress=False,timeout=120))
sd_img.append(img)
de_cutout_url = 'https://www.legacysurvey.org/viewer/cutout.jpg?ra=%g&dec=%g&layer=des-dr1&pixscale=0.5' % (ra_in[i],dec_in[i])
img = plt.imread(download_file(de_cutout_url,cache=True,show_progress=False,timeout=120))
de_img.append(img)
return sd_img,de_img
def plot_cutouts2(img1,img2,cat1,cat2):
"""
Plot images in two rows with 5 images in each row
"""
fig = plt.figure(figsize=(21,7))
for i in range(len(img1)):
ax = fig.add_subplot(2,6,i+1)
ax.imshow(img1[i])
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
ax.tick_params(axis='both',which='both',length=0)
ax.text(0.02,0.93,'ra=%.5f'%list_ra[i],transform=ax.transAxes,fontsize=12,color='white')
ax.text(0.02,0.85,'dec=%.5f'%list_dec[i],transform=ax.transAxes,fontsize=12,color='white')
ax.text(0.02,0.77,cat1,transform=ax.transAxes,fontsize=12,color='white')
ax = fig.add_subplot(2,6,i+7)
ax.imshow(img2[i])
ax.xaxis.set_major_formatter(NullFormatter())
ax.yaxis.set_major_formatter(NullFormatter())
ax.tick_params(axis='both',which='both',length=0)
ax.text(0.02,0.93,'ra=%.5f'%list_ra[i],transform=ax.transAxes,fontsize=12,color='white')
ax.text(0.02,0.85,'dec=%.5f'%list_dec[i],transform=ax.transAxes,fontsize=12,color='white')
ax.text(0.02,0.77,cat2,transform=ax.transAxes,fontsize=12,color='white')
plt.subplots_adjust(wspace=0.02, hspace=0.03)
qg = """SELECT ra,dec FROM mydb://gals LIMIT 5"""
rg = qc.query(sql=qg)
rp = convert(rg)
list_ra=rp['ra'].values # ".values" convert to numpy array
list_dec=rp['dec'].values
cat1='sdss dr16'
cat2='des dr1'
sd_img,de_img = make_cutout_comparison_table3(list_ra,list_dec)
plot_cutouts2(sd_img,de_img,cat1,cat2)
qg = """SELECT ra,dec FROM mydb://gals LIMIT 5 OFFSET 5"""
rg = qc.query(sql=qg)
rp = convert(rg)
list_ra=rp['ra'].values # ".values" convert to numpy array
list_dec=rp['dec'].values
cat1='sdss dr16'
cat2='des dr1'
sd_img,de_img = make_cutout_comparison_table3(list_ra,list_dec)
plot_cutouts2(sd_img,de_img,cat1,cat2)
qg = """SELECT ra,dec FROM mydb://gals LIMIT 5 OFFSET 10"""
rg = qc.query(sql=qg)
rp = convert(rg)
list_ra=rp['ra'].values # ".values" convert to numpy array
list_dec=rp['dec'].values
cat1='sdss dr16'
cat2='des dr1'
sd_img,de_img = make_cutout_comparison_table3(list_ra,list_dec)
plot_cutouts2(sd_img,de_img,cat1,cat2)
W3Schools: SQL Joins https://www.w3schools.com/sql/sql_join.asp
Legacy Survey Sky Browser: https://www.legacysurvey.org/viewer#NGC%203098